Skip to content

DMS: Create deployment and version during bundle deploy#5386

Open
shreyas-goenka wants to merge 1 commit into
mainfrom
shreyas-goenka/bundle-dms-implementation
Open

DMS: Create deployment and version during bundle deploy#5386
shreyas-goenka wants to merge 1 commit into
mainfrom
shreyas-goenka/bundle-dms-implementation

Conversation

@shreyas-goenka

@shreyas-goenka shreyas-goenka commented May 31, 2026

Copy link
Copy Markdown
Contributor

We need to track deployment state in the server. This PR just creates the scaffolding version and deployment when a user approves the plan to apply it.

Note: because lineage is persisted in WAL already we minimize the risk of orphaned deployments (first create request fails) becase we always read lineage from local WAL during replay.

@shreyas-goenka shreyas-goenka changed the title bundle/deploy/lock: add DMS-backed DeploymentManager implementation bundle/deploy/lock: add DMS-backed DeploymentManager implementation using SDK bundle client May 31, 2026
@eng-dev-ecosystem-bot

eng-dev-ecosystem-bot commented May 31, 2026

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: caeb738

Run: 28368152735

Env 🟨​KNOWN 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
🟨​ aws linux 7 2 1 13 230 1039 5:50
🟨​ aws windows 7 1 13 234 1037 6:35
💚​ aws-ucws linux 8 13 316 957 4:48
💚​ aws-ucws windows 8 13 318 955 3:59
💚​ azure linux 2 15 232 1038 4:00
💚​ azure windows 2 15 234 1036 3:47
💚​ azure-ucws linux 2 15 318 954 5:12
💚​ azure-ucws windows 2 15 320 952 3:50
💚​ gcp linux 2 15 231 1040 3:29
💚​ gcp windows 2 15 233 1038 3:20
23 interesting tests: 13 SKIP, 7 KNOWN, 2 flaky, 1 RECOVERED
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/recreate/embedding_dimension 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestFilerWorkspaceNotebook 🔄​f ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p
🔄​ TestFilerWorkspaceNotebook/sqlNb.sql 🔄​f ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p
💚​ TestFetchRepositoryInfoAPI_FromRepo 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
Top 4 slowest tests (at least 2 minutes):
duration env testname
2:47 azure windows TestAccept
2:38 azure-ucws windows TestAccept
2:26 aws-ucws windows TestAccept
2:24 gcp windows TestAccept

@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/bundle-dms-implementation branch from 98dc0c7 to 4a4382f Compare June 1, 2026 12:33
@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/bundle-dms-implementation branch from 4a4382f to de9adfd Compare June 1, 2026 12:35
@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/bundle-dms-implementation branch from de9adfd to bb16fcd Compare June 1, 2026 15:11
@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/bundle-dms-implementation branch from bb16fcd to bdc7ba2 Compare June 1, 2026 15:16
@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/bundle-dms-implementation branch from bdc7ba2 to 1e8ba45 Compare June 1, 2026 15:23
@shreyas-goenka shreyas-goenka changed the title bundle/deploy/lock: add DMS-backed DeploymentManager implementation using SDK bundle client bundle/deploy/lock: record deployment history in DMS behind experimental.record_deployment_history Jun 1, 2026
@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/bundle-lock-abstraction branch from ff910cd to ee860e8 Compare June 1, 2026 15:31
@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/bundle-dms-implementation branch from 1e8ba45 to 76bf017 Compare June 1, 2026 15:31
@shreyas-goenka shreyas-goenka requested a review from denik June 1, 2026 17:02

// The server validates that versionID equals last_version_id + 1 and returns
// ABORTED otherwise (e.g. a concurrent deploy already created this version).
version, versionErr := svc.CreateVersion(ctx, sdkbundle.CreateVersionRequest{

@shreyas-goenka shreyas-goenka Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will not work well when the plan is serialized and potentially outdated (because we do not use serial here)

Will be fixed in a followup.

@shreyas-goenka

Copy link
Copy Markdown
Contributor Author

We can remove the traditional file based lock in a followup. Not necessary for now / preview.

Comment thread bundle/phases/deploy.go Outdated
Comment thread bundle/deploy/lock/deployment_metadata_service.go Outdated
return fmt.Errorf("failed to parse version ID %q: %w", versionID, err)
}
r.versionNum = versionNum
r.stopHeartbeat = startHeartbeat(ctx, r.svc, r.deploymentID, versionID)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call CreateVersion twice: in deploy and destroy and seem to start heatbeat twice, shall we have only 1 instance of heartbeat?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify? Those are independent code paths and both need a heartbeat right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, my bad, indeed a separate processes

@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/bundle-dms-implementation branch from 607bdd0 to 98f4444 Compare June 15, 2026 13:40
Comment thread bundle/phases/deploy.go Outdated
Comment thread bundle/phases/deploy.go Outdated
bundle.ApplyContext(ctx, b, lock.Release(lock.GoalDeploy))
}()
if err := recorder.CreateVersion(ctx); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we remove file lock later on, how do we ensure that there's no race condition creating multiple versions? We don't seem to do any synchronisation / locking here now, is it part of follow up?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CreateVersion implicitly has locking semantics. Only one client can have a "live" version that is in progress at a time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server-side version counter is the synchronization: CreateVersion only succeeds when the requested version is last_version_id + 1, otherwise it returns ABORTED (409). So even without the file lock, two concurrent deploys racing to create the next version would have one win and the other get ABORTED. Surfacing that as a clean user-facing error (retry/serial handling) is the follow-up I noted.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, make sense. Could you add an acceptance test for it though to make sure this behaviour is recorded?

@andrewnester andrewnester self-requested a review June 16, 2026 12:36
@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/bundle-dms-implementation branch from 98f4444 to 87c99e8 Compare June 17, 2026 00:27
@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/bundle-dms-implementation branch from 87c99e8 to 9a7feb1 Compare June 17, 2026 00:28
if db.Data.Lineage == "" {
db.Data.Lineage = uuid.New().String()
}
return db.Data.Lineage

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is initialized in Open() below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For direct yes. In case of DMS this is initialized at pla n time here instead. I did not want to touch direct deployment code paths.

@shreyas-goenka shreyas-goenka Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have a common function to initilaize the lineage. (GetOrInitializeLineage). I confirmed that this is stored in WAL.

Comment thread libs/dms/recorder.go Outdated
if parseErr != nil {
return "", fmt.Errorf("failed to parse last_version_id %q: %w", dep.LastVersionId, parseErr)
}
versionID = strconv.FormatInt(lastVersion+1, 10)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to update this logic to also consider the existing serial numbers for migration scenarios. That'll be a followup.

Comment thread bundle/phases/destroy.go
// Set up DMS recording of this destroy as a version. The version is not
// created until the destroy is approved (below), so a cancelled destroy
// records nothing; the deferred CompleteVersion is a no-op until then.
recorder := newDeploymentRecorder(ctx, b, engine, dms.VersionTypeDestroy)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serial was already fixed when we created the plan, we should read it from there, not from the backend.

For clarity, we should also avoid storing serial/lineage on recorder object and just pass it to CreateVersion() directly from the plan, so that we have one source of truth for this info.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout, done. We use serial from plan. In a followup I'll add support for storing and reading previous_version_id from plan as well since that requires a SDK version updatE (to get previous_version_id)

Comment thread bundle/phases/deploy.go
// Record the DMS version now that the plan is approved (a cancelled deploy
// records nothing). The deployment lineage and the version's serial both
// come from the plan; CompleteVersion below finalizes this same version.
if err := recorder.CreateVersion(ctx, plan.Lineage, plan.Serial); err != nil {

@shreyas-goenka shreyas-goenka Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a followup - we'll also store and set previous serial in plan - to ensure serializability.

Records each approved deploy/destroy as a version with the Deployment Metadata
Service (DMS), gated by experimental.record_deployment_history and the direct
engine. The version is created only after the plan is approved — a cancelled or
declined deploy/destroy records nothing, so there are no empty/abandoned
versions for operations that never ran.

- libs/dms: Recorder with CreateVersion / CompleteVersion. The deployment ID is
  the state lineage (from GetOrInitLineage), so a bundle deployment maps
  one-to-one to a DMS deployment record. GetDeployment first, CreateDeployment
  only when missing, then create the next version; heartbeat keeps the version's
  lease alive; CompleteVersion records success/failure and, for destroy, deletes
  the deployment record on success. Independent of bundle/lock.
- phases: newDeploymentRecorder builds the recorder from the bundle (nil unless
  the flag is set and the engine is direct); deploy/destroy create the version
  inside the approved branch (after UpgradeToWrite, so the lineage is already in
  the WAL) and complete it in the deferred lock release.
- libs/testserver: in-memory DMS handlers under /api/2.0/bundle/...
- acceptance/bundle/dms: deploy/redeploy/destroy record versions and hold the
  file lock; redeploy after deleting .databricks recovers the lineage from
  remote state; enabling the flag after a plain deploy creates a new deployment;
  a declined destroy records no version and does not delete the deployment.

Co-authored-by: Shreyas Goenka <shreyas.goenka@databricks.com>
Comment thread libs/dms/recorder.go
// Package dms records bundle deployments as versions with the Deployment
// Metadata Service (DMS).
//
// It is intentionally independent of the deployment lock: a Recorder does not

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a followup we can deprecate the file based lock. Out of scope for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants